190

14

The Nature of Living Things

than that of the protein (transcription factor)-based regulation; this is borne out by

the length of “noncoding” DNA (proportional to rr) increasing quadratically with the length of

coding DNA (proportional to gg) above the 10 Mb threshold. It begs the question of why protein-

based regulation is used at all, even in prokaryotes, if the RNA-based system is

effective and much less costly, but our present knowledge of RNA-based regulation

seems to be too incomplete to allow this question to be satisfactorily addressed.

DNA Base Composition Heterogeneity

The base composition of DNA is very heterogeneous, 30 which makes stochastic mod-

elling of the sequence (e.g., as a Markov chain) very problematical. This patchiness

or blockiness is presumed to arise from the processes taking place when DNA is

replicated in mitosis and meiosis (Sect. 14.4.1). It has turned out to be very use-

ful for characterizing variations between individual human genomes. Much of the

human genome is constituted from “haplotype blocks”, regions of about 10 Superscript 410410 Superscript 5105

nucleotides in which a few (less than 10<10; the average number is 5.5) sequence variants are

said to account for nearly all of the variation in the world human population. The

haplotype “map” is simply a list of the variants for each block. 31

Haplotypes are essentially long stretches of DNA characterized by a small number

of single-nucleotide polymorphisms (SNPs—pronounced “snips”); that is, mutated

nucleotides. There is an average of about 1 SNP per thousand base pairs in the human

genome; thus, if they were uncorrelated, in a typical 50 000 base pair haplotype block

there would be about 2 Superscript 50250 (or 4 Superscript 50450, depending on whether we are interested in what

the base is mutated to) variants—far more variation than is actually found. Hence,

the pattern of SNPs evinces extremely strong constraint; that is, the occurrences

of individual SNPs are strongly correlated with each other. There is considerable

current interest in trying to correlate haplotype variants with disease, or propensity

to disease (Sect. 26.3). 32

One notes that as much as 98% of the human genome may be identical with that

of the ape; one could equally well state that there is more genetic difference between

man and woman than between man and ape. To actually derive the vast phenotypic

differences between the two from their genomes appears to be as vain a hope as

solving the Schrödinger equation for even a single gene.

As an information-bearing symbolic sequence, the genome is unusual in that it

can operate on itself. The most striking example is furnished by retrotransposons

(i.e., transposable elements, whose existence was first proposed by McClintock in

1950). These gene segments inter alia encode a reverse transcriptase enzyme, which

facilitates the making of a DNA copy of the sequence. The duplicate sequence is

then inserted into the genome; the point of insertion may be remote from that of the

30 For example, Karlin and Brendel (1993).

31 See Terwilliger and Hiekkalinna (2006) for a critique of the International HapMap Project.

32 Another curiosity is that certain DNA sequences display extraordinarily long-range (10 Superscript 4104 base

pairs or more) correlations (see, e.g., Voss 1992).